Gate Level Synthesis

Lecture Slides

Here are the lecture slides from the previous semester:

Download Here

The synthesis Flow

After studying the formal description of a logic circuit using a Hardware Description Language, we can look at how this description will be mapped to a real harware circuit.

A few concepts, which have a direct impact on how the logic will look like, have already been introduced and practised:

  • Defining Clock Synchronous logic blocks
  • Defining pipeline stages by using “non blocking” value updates
  • Defining a reset to bring a circuit into an initial state

As we have mentionned in the introduction to this lecture, we are following a so called “Semi-Custom” design flow.

This means that the circuit implementation is automated, based on a provided set of ready to use logic functions. The algorithms then only translate the HDL description to a logic network whose single functions are available a physical components to be used in the final circuit.

When starting with synthesis, the user should gather all components to make a system physically viable, called resources, and prepare all the specification constraints.

The synthesis process itself is thus not very complex, it consists in:

  • Loading all the component definitions: Resources and Design sources (Verilog/VHDL)
  • Elaborating the design in a formal representation of the hiearchy and logic
  • Loading the constraints for timing
  • Mapping the design to target logic using the provided logic cells
  • Analyzing the timing reports to make sure the design is feasible.
  • Fixing any timing issues and start over, or save the resulting circuit.

Resources

The resources are all the functional blocks which are not described using the behavorial model, but required for a physical implementation to be viable.

Some of the most common resources are going to be:

  • Memories like SRAM
  • Phased Locked Loop for clock multiplication and division.
  • General Intellectual Property blocks (IP Block) : Any kind of design part which can be bought from a third party provider

As a matter of fact, Standard Cells are also resources, but we will present them in a later part: Standard Cells.

Clock Management using PLL

A simple example of a useful resource in a circuit is a clock manager, build around components like:

  • Phased Locked Loop
  • Clock dividers
  • Phase aligners

Indeed, digital systems rarely work using one single clock. It is very common for various sub-systems to used their own clock, some may be very fast processing not-so-wide data paths, others slower but processing wider data buses every cycle.

When building an integrated circuit, the designer must be aware of the required clocks and make sure the circuit has a mean to produce the required clocks.

A fractional PLL is a typical building block to provide clock on an integrated circuit, because it has the advantage of being configurable. The system on chip can the easily offer multiple function modes for user, like in Microcontrollers, where clocking can be fine tuned to reach desired performance while saving power.

On a custom circuit, using a PLL will require:

  • A stable, or low-jitter, reference clock which is usually slow: A Quartz Oscillator For example
  • A PLL block, which can be bought or designed.
  • A logic to offer configurability to the user, with various clock multipliers and dividers combination
../../_images/pll.png

SRAM Memories

External and Local Storage

Memories can easily be modeled in Verilog using a two dimensional array.

reg [N -1 : 0] data [M - 1];
../../_images/regmem.png

However, memories are usually wide and deep in most systems, as the data width, for example in network applications will usually be at least 64 bits, and a the depth adepted to improve throughput.

One can imagine, if a memory is defined using a depth of 1024 and a width of 32bit , 4 Kbytes, a total of 32 thousand flip-flop will be generated. Moreover, a memory is usually accessed in a random way, which means that all the bits in a row will be accessed using an address, henced will be part of a selection Multiplexer:

input [ADDRWIDTH - 1 : 0] address;

reg [N -1 : 0] data [2**ADDRWIDTH - 1];

always @(posedge clk) begin
   ...
   output <= data[address];
   ...
end

In that case, the area cost of the flip-flops and the load of the multiplexer would be very high and hence slow. That is why it is important to replace register arrays by RAM memories, which are optimised for high-speed and high-density.

Using RAM memories in a design is usually realised in two ways:

  • For large storage, one can use an external device, like in standard computers. The user will need to acquire or write a controller for the external device.
../../_images/memddrcontroller.png
  • For internal or local storage like data buffers in the form of First-in First-out blocks for example, a small SRAM represented by a verilog module can be used.
../../_images/memfifosram.png

In this example, we represent a data processor of some kind, which would produce data blocks, store them in a FIFO, which would be transmitted by an external output module. The FIFO will need a local data storage, in the form of an SRAM.

Local SRAM Generators

RAM memories, being highly optimised, are not syntheised based on a logic behavior. There are also no standards for a vendor tool to generate RAMs from Verilog/VHDL during synthesis.

Typically then, one can buy a RAM generator tool, which is a proprietary software used to generate the RAM block physical description along with an empty Verilog module to instantiate the Memory.

The generator will also provide a performance evaluation of the desired block, so that the user can decide which are the best variants to use.

RAM Size Physical Size Speed of first pin
512x32 231x104 µm²
1024x64 572x171 µm²

Static Timing Analysis

Static Timing Analysis, short STA, is the central component of the whole circuit synthesis flow.

Indeed, we have defined our basic logic entity in a circuit as being the clock synchronous Pipeline Stage. The whole point of the automated synthesis and place and route processes is to calculate the time cost of the implemented logic, so that we can ensure a correct and stable run over the device’s life.

Therefore, the timing analysis is simply presented as being:

  • A sum of time costs for the various gates and wires from a value’s start point (input, register) to an endpoint (output,register)
  • A comparison to the target maximal allowed time, which is the clock period, giving a positive or negative slack.
../../_images/sta-startend.png

Timing Paths: From Start to End in respect to target time

Time Costs

We have just defined timing as a sum of costs. These have to main origins:

  • Tgate: Gate propagation delay. The time the data needs to transfer from the input to the output.
  • Trc: Interconnection delay. The time required by the signal to travel on a wire, modeled as a resitor and capicitance pair.
../../_images/sta-trc.png

Wire as a Resistor and Capacitance depending on Fanout

The Gate delay is know and provided by the logic function libraries. The Timing parameters have been characterised. The interconnection delay however depends on the circuit, so it won’t be known right away.

Wireload and Extraction Wire timing

In earlier days of circuit synthesis, the wiring delay was based on statistical models calles Wireload models. These models would provide typical delays based on wire length, fanout etc…

However, since the transitor gate length constantly shrinked, and their speed increased, the interconnection cost became too predominant and sensitive to physical phenonemas like cross-talking to be accuretly rendrered through a statistical model.

The following picture shows the rough cost of interconection and gate, depending on the process size. The cross-over is for a technology of around 180nm transistor gate length.

../../_images/sta-netgatedelay.png

(no source)

Thus, modern tools abandonned the statistical approach in favor of real physical extraction.

This means that synthesis is usually performed in two phases:

  • First with ideal wiring delays
  • In a second time, when the physcial layout is known, with integrated fast wire extraction.

Timing Parameters

When analysing Timing, each cost is represented by a timing information based on two types of parameters:

  • The Delay is a propagation time through an element, gate or wire
  • The Slew is the rise or fall time at a certain input or output.
../../_images/sta-slewdelay.png

These two parameters are correlated, a long slew on a wire induces a long delay, but are used in combination by the tool to optimise the timing results.

Typically, the delay of a logic cell will depend on the slew at it’s inputs, so the delay can be tuned by controlling the slew, which in turn depends on the wiring and output slew of the previous cell.

These parameters are important to understand because they appear everywhere in reports and documentations, but they are not really tuned by hand during design.

Timing Checks: Setup and Hold

Since we have defined timing analysis as a function of a given time cost against a target timing, we can define the result of the analysis as being a timing check.

We also have defined the timing paths between start and end points. Obviously the start points add to the time cost, and the end points define the time target to be check against.

The end points have been defined as being either:

  • Flip-Flop inputs, which have timing constraints.
  • Output ports, which are just a physical output with no timing meaning other than the user defined (see Input and Output Delays).

The internal model of a Flip-Flop has been presented in a previous lecture, using a two-capacitors model. The following picture shows the state of the switches depending on the clock state:

../../_images/sta-acctransfer.png

“Save and transfer” representation of the flip flop operation

We can try to draw the same picture the show the dynamic at the clock edge:

../../_images/sta-ffedge.png

Switching around the clock edge

Using those two views, we can understand that around the clock edge, the input has to be stable to guarantee no overlaping between an input change and the time the switches need to operate and the capacitance to settle.

Any overlapping could cause the circuit to enter a metastable state, which means the sampled input of the Flip-Flop could be of indetermined value, hence introducing an error in the logic’s result.

We can now define two timing checks:

  • Setup which is the dead time before the clock edge, or latest arrival.
  • Hold which is the dead time after the clock edge, or earliest departure.

Hint

In the case of Input or Outputs, the setup or hold are unknown, so the design must be constrained to manually define the time to spare from the clock period, in reserve for the potential next flip-flop.

Timing Corners

To summarize, we have stated that the timing analysis consists in summing the time costs of gates and wires between a start and and endpoint, that is a Timing Path, comparing the result to a given period of time, and verifying it against a latest or earliest time.

The keywords latest and earliest tell us here, that we can distiguish two different cases:

  • The latest time is given when the costs are the highest.
  • The earliest time is given when the costs are the lowest.

These set of costs are called corners. The timing corners in a system are all the physical configurations of the circuits, which might change the time cost of the logic. Typically, these configuration are:

  • Temperature
  • Supply Voltage
  • Speed of the logic: Slow, fast, typical
  • Advanced concepts dependend on the technology like On-Chip variations.

The Logic Speed is an important characterisation aspect, usually, the names of the corners are following:

  • SS for slow, or WC for Worst Case
  • TT for typical, or TC for Typical Case
  • FF for fasr, or BC for best case.

The key concept to remember about corners is simply that the Setup analysis is performed on slow corners, and the hold analysis on fast corners….for latest arrival times, and fastest departure times.

Setup and Fixes

The Setup Time analysis thus calculates the worst times and returns:

  • A positive slack if the time is under the clock period minus the setup time.
  • A negative slack if the time exceeds the clock period minus the setup time.

The following picture is an illustration of the Setup timing analysis, you can find another picture in the lecture slides.

../../_images/sta-setup.png

Illustration of setup calculation for Flip-Flop to Flip-Flop, or from/to a simple input/output.

In this picture, we can note that the timing cost at the input are described as fixed. Indeed, the transfer time at the clock edge in the flip-flop should be taken into account. This cost is named Clock To Output for the register, or input delay in case of an undetermined input.

In case of negative slacks, a positive timing can be reached by:

  • Optimising the used logic by enabling faster cells for example, or reducing the logic complexity by changing the reset method or enabling clock gating.
  • Reducing the clock frequency obviously
  • In the worst case, redesigning the processing by breaking down the logic into more pipeline stages. This introduces a lot of changes in the logic design however.
  • Some other advanced methods are used by tools, like Clock Skewing, where the clock is delayed to gain time, but the next stage loses time.
../../_images/sta-setupfix.png

Some illustrations of possible Setup Fixes

Hold and Fixes

The hold analyses then looks at the fastest behaviour of the cells and ensures the time is greater than the earliest departure or hold time:

  • A positive slack is given if the timing is larger than the hold time
  • A negative slack is given if the timing is smaller thant the hold time.
../../_images/sta-hold.png

Hold added to Setup timing, as minimal time required by the logic to complete.

Fixing the Hold Time is rather easy and offers limited options:

  • Slow down the logic by adding buffers before the Flip-Flop.
  • Skew the clock by makin the logic start later, at the cost of a worst setup time
../../_images/sta-holdfix.png

Warning

Hold timing must always be fixed, because once the circuit is produced, it is difficult to make the logic slower. On the contratry for setup, it is easy to slow-down the design.

Asynchronous Clock Domains

It is very common for digital systems to feature multiple clock domains.

A clock domain regroups all the Flip-Flops which are driven by one clock, and is usually considered independent of the other clock domains. It means that the timing analysis should stop at clock domain boundaries, since the clock edges relationships are unknown.

../../_images/sta-notiming.png

No timing analysis at clock domain boundaries

Asynchronous clock domains are found for example where a low speed clock domain processes wide data paths, and transfers the data to be transmitted by a high speed clock domain which serialises the result.

Standard Cells

Standard Cells are the low-level logic functions made available to the tools to map the design.

The standard cell library provides mainly cells of these types :

  • Logic functions (AND, OR, XOR, ANDOR etc…)
  • Special Cells like Tie High and Tie Low (for constant 0 and 1), or technology related physical cells.
  • Input/Output cells to interface to the outside world and perform votage translation.

The Standard Logic cells are provided with a common physical layout:

  • Same height
  • Power Lines (VDD and VSS) on top and bottom
../../_images/stdcell-one.png

A standard cell overview with I/O and power lines

This way, the cells can be organised in rows and abutted next to each other.

../../_images/stdcell-rows.png

A few cells dispatched into rows.

The standard cell library then also provides for each logic cell, different variants to be chosen by the tool, for example:

  • Multiple Driving Strenghts: Depending on the Fanout of a logic function, the output transistor might have to be chosen larger to be able to drive the output capacitance in a reasonable time.
  • Multiple Threshold: The transistors can have low, normal or high voltage threshold. The lower voltages are faster but have a way higher leakage current.

The following picture is taken from the UMC-65nm databook and shows the naming convention of the cells, which is typical among technology vendors:

../../_images/stdcell-naming.PNG

Cells Strenghts

The Cells come in various Strengths to enable driving more or less gates. The Stronger the cell however, the more area leakage and time it needs..

The following table is extracted from the Library definition:

Parameter AND3X1 AND3X2 AND3X4 AND3X6 AND3X8 AND3XL
Area (µm²) 2,39 2,74 3,08 4,45 5,13 2,05
Mean Leakage (pW) 94,09 128,83 233,29 385,43 458,10 81,49
Timing A->Z (ns) 0,54 0,37 0,31 0,23 0,20 0,71

Hint

The Data is from a Testing Design Kit based on an IBM 45nm technology

Cells Timing Corners

Libary Area (µm²) Mean Leakage (pW) Timing A->Z (ns)
fast_vdd1v0_basicCells.lib 2.394 94,09 0,54
fast_vdd1v0_basicCells_hvt.lib 2.394 94,78 0,84
fast_vdd1v0_basicCells_lvt.lib 2.394 607,53 0,44
slow_vdd1v0_basicCells.lib 2.394 33,74 1,76
slow_vdd1v0_basicCells_hvt.lib 2.394 22,62 2,48
slow_vdd1v0_basicCells_lvt.lib 2.394 190,57 1,20

Design Constraints

The design constraints are the formal translation of the system specifications, which can be read by the toolchain to drive the synthesis and produces reports. The constraints are mostly targeted at timing specification, but can also be used for advanced concepts like multiple power domains specifications or physical placement, depending on the target technology and tool vendor.

We will focus here on timing specifications which are globally standard for all technologies.

Timing specifications are usually written in a machine readable format, like a configuration file. Some tools may have a proprietary format, but we will illustrate constraints here using the standard SDC format, which is used in the übung.

Warning

The commands presented may not be exactely the same one used depending on the tool vendor. Focus on the purpose of each constraint to adapt to a specific project.

Clock Definition

Defining clocks is the first step, as we have seen the clock period defines the target time.

The SDC command is called create_clock and requires:

  • A Period
  • A target pin or port to apply to
  • A name

Multiple clocks can be defined in a system, but they should be considered independent of each other.

Overconstraining using uncertainty

We have seen that the time costs is made of gate and wire delay. While gate delays are well known, wire delays may be uncertain until the physical layout of the design is fully know.

../../_images/constr-uncertainty.png

Cost uncertainty for wiring.

It then makes sense to overconstraint a design in the early stage of a design to spare time from the clock period.

This can be easily done using the clock uncertainty commands, which basically removes available time for Setup timing, and add minimal time for hold timing.

How much uncertainty should be added is more or less empirical, but a start value of 10 to 20% can be used for a 65nm technology for example. If the target technology size is suspected to carry a heavier load on wiring, this value should be incremented.

Generated Clocks

Clocks can be dependend on each other, meaning that one clock can have a relatioship to another based on a multiplication or division factor. If the clocks are dependend, they will be treated as one clock domain and timed together.

Generated clocks should be used carefully though, and the user should be certain that the relationship between the clocks is known and won’t drift over time.

The SDC command is called create_generated_clock and requires:

  • The source clock name
  • The Multiplication or division factor
  • The target pin or port to apply to

Modes

Timing Modes are used to describe different use cases of a design, for which the constraints might change. The typical use case of modes arrises if some logic paths can respond to different clocks depending on the use case.

Usually designs have to modes:

  • A “main” mode, whith the target constraints of the application
  • A “test” mode, where some flip-flops can be driven by a slow clock and bit injected in the logic to verify correctness.

The testing mode is usually implemented as a JTAG chain, which is used after production to verify the circuit is functional.

False Paths

False paths are used to defined paths which are to be exclused from the timing calculation. We have seen this case for multiple clock domains, where paths at clock domain boundaries should not be timed.

Depending on the tool used, false paths may be created automatically at clock boundaries, but it is safer to define them manually using the set_false_path command.

The same case may apply to an asynchronous reset. These resets are usually meant to be set for a long time, and deactivate the whole system, clocks included. In this case, timing is usueless on these paths, so a false path can be set.

Multicycle Paths

Multicycle paths are a timing relaxing feature.

Some logic paths, like configuration bits are rarely changed, or even only changed when the design is disabled because they might change the behaviour.

These paths don’t need to propagate within one clock cycle, but are still required to be timed to arrive before setup time and avoid useless timing violations.

The command set_multicycle_path can be used to allow these special paths to use more than one clock cycle.

Input and Output Delays

Input and output delay are used to remove time from the clock period to ensure delays from or reserved to external components are taken into account.

It is the same concept as the timing between two registers, but in the case of input or output, we only have one side of the logic:

  • The input port to the first logic
  • The last flip-flop in the pipeline to the output port

The following picture summarizes this case, where two modules could be connected after the circuit has been build, and thus requiring some register-to-register time, which is offered by a conservative input/output delay:

../../_images/input-output-register.png

In a more general way, the input and output delays are set in the following manner:

../../_images/input-output-delay.png

The commands in SDC standard are called: set_input_delay and set_output_delay, their actual usage might vary depending on the tool vendor, we will use them with the Cadence toolchain during the lab work.

Input Driver and output Load

Similar to the input and output delay allowing to spare clock time for the unknown logic delays, the input driver and output load can be specified to define the signal rise or fall time at the input or outputs.

Indeed, the input slew has an impact on the speed of the input logic, and depending on the output load, a cell of a certain strenght can be needed to meet timing requirement.

../../_images/input-output-load.png

The SDC constraints to be used are (look at the tool documentation for precise syntax):

  • At the inputs:
    • use set_driving_cell to specificy a Cell from the technology library whose parameters will be used for input slew
    • use set_drive to specify a resistance at the input, if specifying a library cell does not apply
  • At the outputs:
    • use set_load to specify the capacitance the output has to drive

Synthesis Output

The synthesis process simply transforms the abstract logic description in verilog, to a verilog file containing instances of the logic cells from the standard library, represented as modules:

 // DFF are Flip FLops
 // Here a shift register, note the CK Connection
 // D and Q not directely connection because of reset logic
 DFFRHQX4 \johnson_value_reg[1] (.RN (res_n), .CK (clk), .D (n_820),
   .Q (n_95));
 DFFRHQX4 \johnson_value_reg[2] (.RN (res_n), .CK (clk), .D (n_821),
      .Q (n_91));
 DFFRHQX4 \johnson_value_reg[0] (.RN (res_n), .CK (clk), .D (n_482),
      .Q (n_96));

// Some logic Cells
//------------
OAI21X4 g3060(.A0 (n_502), .A1 (n_547), .B0 (n_326), .Y (n_481));
OAI2BB2X2 g3061(.A0N (n_239), .A1N (n_425), .B0 (n_318), .B1 (n_644),
   .Y (n_482));

Advanced Topics

To conclude this chapter, we can analyse how the logic is synthesised for two specific cases which are easy to use in the hardware description:

  • Reset Type: Asynchronous or Synchronous
  • Clock Gating

Reset Type

We have presented in the section about verilog how to define a synchronous or asynchronous reset. These two resets types are easy to understand :

  • Synchronous Reset if mapped with the clocked logic…as part of the logic
  • Asynchronous Reset is not dependend on the clok…mapped to Set/Reset port of Set/Reset Flip-Flops

Async Reset:

// ASYNC Reset
always @(posedge clk or posedge reset) begin
   if(reset)
      value <= 0;
   else
      value <= value + 1;
end

Syn reset:

// SYNC Reset
always @(posedge clk) begin
   if(reset)
      value <= 0;
   else
      value <= value + 1;
end
../../_images/reset.png

Clock Gating

Clock gating is a common optimisation method that can be enabled during synthesis to move logic that clearly enables a value change of a register, from the data path, to the clock line.

Indeed, this kind of logic is usually called “enable logic”, as it allows bringing the logic in a still stand. When a logic circuit is disabled, and it’s values won’t change, it behaves as if no clock was toggling at the register inputs.

The synthesis tools can detect the enabling logic, and move it to disable the clock line:

always @(posedge clk) begin
   if(reset)
      value <= 0;
   else if (enable)
      value <= value + 1;
end
../../_images/cgating.png

Clock gating can be enabled for most designs, as it can really reduce power consumption during inactivity periods. However, it introduces logic on the clock path, meaning the clock delay for each pipeline stage will not be constant, and it can introduce timing issues for cricitical logic paths (see setup and hold clock delay pictures).